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ABSTRACT 


Heart rate (HR) is one of important indicator for human physiological 
diagnosis, and camera can be used to detect it via photoplethysmograph 
(PPG) signal extraction. In doing so, number of sample images required 
to measure the HR signal, and quality of the images itself are important 
to yield an accurate reading. This paper tackles such an issue by analyzing 
the effect of sampling interval to HR reading in compressed and original 
video format, obtained in various ranging locations. Technically, important 
facial points from video stream were estimated by using cascade regression 
facial tracker. Based on the facial points, region of interest (ROI) was 
constructed where non-rigid movement is minimal. Next, PPG signal was 
extracted by calculating the average value of green pixel intensity 
from the ROI. Following that, illumination variation was _ separated 
from the signal via independent component analysis (ICA). The PPG signal 
was further processed using series of signal filtering techniques to exclude 


frequencies beyond range of interest prior estimate the HR. 
From the experiment it can be observed that sampling time of 2 seconds in 
uncompressed video shows promising HR within the range of 1 to 5 meters. 


This is an open access article under the CC_BY-SA license. 


Corresponding Author: 


Mohd Razali Md Tomar, 

EMCenter, Universiti Tun Hussein Onn Malaysia, 
86400, Parit Raja, Batu Pahat, Johor, Malaysia. 
Email: mdrazah@uthm.edu.my 


1. INTRODUCTION 

Over the past years, many methods introduced to monitor human HR reading since it closely related 
to human physiological aspects. Recently, the interest is more concentrate on non-contact HR monitoring that 
especially useful to the patients with burn skin, elderly people that have fragile skin and premature infants 
that have extremely sensitive skin. One of the most cost effective non-contact devices is based on camera that 
measure the HR via PPG signal extraction. PPG is a simple non-contact optical measurement technique that 
can measure pulse activities that connected to human cardiac system from blood flow due to muscle 
contraction [1]. It was introduced back in 1973 by Hertzman et. all [2] that showed the light transmission 
variation of a finger could be detected by photoelectric cell. Based on his initial work, further research was 
conducted and found that, human face video that is recorded by normalcamera under ambient light, contains 
useful signal that rich enough to measure the HR [3]. Some of the trend of works that utilize PPG signal to 
measure HR from colour-based method via web camera can be found in [4-8]. In the camera based HR 
domains, there are also reported that instead of using three colours channel (RGB), single green channel 
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provide more accurate outcome for PPG signal extraction because haemoglobin light absorption is most 
sensitive to oxy-generation changes for green light compared to blue and red lights [5, 9]. Another interesting 
works that had been introduced is pulse detection via head motion [10] that showed promising outcome 
for translating subtle head motion into HR estimator. 

Almost all of the mentioned works proven to yield promising results, but their main concerns only 
revolve around motion artifact [11-14] and illumination variation [2-5]. They did not consider sampling time 
requirements and video compression that will affect the HR reading accuracy. Recently, there is one 
interesting report made by Yu et. all regardmg minimum recording time requirement of input video 
for image-based monitoring system [15]. Based on their experiments, they stated that if longer video duration 
is used as an input for image-based monitoring system, the pulse reading accuracy would deteriorate. 
However, they still not consider the case where the video was compressed especially will be useful 
in the surveillance camera application. When the video was compressed, image quality of the face will be 
degraded and consequently will affect the PPG signal especially to the signal shape [16]. Meanwhile Mcduff 
et. all, stated that video compression degrade the signal to noise (SNR) ratio of PPG signal, thus affecting 
the accuracy of HR reading [17]. Work by Zhao et all also suggest that there are deterioration in PPG signal 
amplitude, SNR and signal trace due to video compression [18]. Their findings about video duration is very 
interesting, thus motivated us to analyse more by integrating the minimum time sampling requirements 
to the compressed video for measureing the non-contact HR reading. 


2. SYSTEM OVERVIEW 

The framework of this project consists of five main steps which are facial detection and facial 
tracker algorithm, raw PPG signal extraction from green channel, illumination variation elimination using 
ICA, signal filtering and histogram analysis. Figure | depicted overall block diagram of the system. Initially, 
facial detection was applied to the recorded videos for localizing human’s face in the videos. Next, facial 
tracker was applied to the detected face region to extract important facial points that later on will be used 
during PPG signal extraction. The facial tracker produced 49 points based on prominent human facial 
features, and based on these points reagion for raw PPG signal extraction will be labelled. PPG signal was 
then extracted from the labelled region using temporal random trace information of the green channel since 
green channel has a good SNR reading [19]. The extracted of raw PPG signal contains unwanted noise due 
to environment’s illumination and motion artifact. To cater this, combination of Independent Component 
Analysis (ICA) [20] and series of signal filtering were applied to the raw signaland hence making the signal 
to be smoother and easier to work with. The refined PPG was then converted to frequency domain 
for determining the Power Spectrum Density (PSD) that will be utilize for the HR calculation. 
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Figure 1. Overall system plan 


However, relying on single sequence of HR reading is still subject to the measurement vaniation. 
To overcome this, a histogram analysis of repetitive HR reading was constructed based on the same ROI with 
different random traces. Eventually, the HR is estimated from the average of histogram with lowest 
variation reading. 
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2.1. Facial detection 

Facial detection and facial landmark were used to detect the location and prominent facial features 
of targeted subject. In this project Viola-Jones (VJ) based facial detection using AdaBoost-based cascade 
with Haar like features [21] 1s employed. This classifier works by constructing a strong classifier 
(positive images) as linear combination weak classifiers (negative images). During detection, a series 
of classifiers are applied to every image sub-window with different scaling factor. Regions are considered 
valid if they pass through all the classifier stages. As for facial tracker, combination of Discriminative 
Response Map Fitting (DRMF) and Monte Carlo parallel linear regression [22] was used. The method works 
by performing a raw initial guess of facial landmark positions and uses a cascade of regressors to infer 
the shape as whole and explicitly minimizes the alignment error over the training data. The mathem atical 
modeling for the facial tracker can be represented as shown in (1), where S=(xl, x2...xp) denotes 
the coordinate of all p faciallandmarks in a bounding box I and rt(...) be the regressor cascade. 


SGD=SO + 7d, S®) (1) 


This facialtracker would produce 49 facial points based on important human features which include 
the rigid and non-rigid points. Sample of the detected 49 points facial landmarks are as shown in Figure 2 
where the left side shown the labelled number of facial pomts and the right side shows its real 
implementation on actualimage. From the detected 49 points, four ROIs were selected to determine the most 
suitable face area that would results in most accurate HR for near and long distance application. The chosen 
face areas were as shown in Figure 3. Right cheek was chosen as the first ROI since less non -rigid motion is 
generated in this area compared to otherregions. Another area selected was the center of face because a study 
claimed that this area considered to be the most suitable area for PPG signal extraction for video based HR 
system. Next, whole face area was selected since hypothetically, larger ROI means that the possibility 
to extract the PPG information from a far distance is high. However, 10% horizontal dimension reduction 
and 20% vertical dimension reduction were applied to this area in order to exclude unwanted background that 
might affect the PPG signal extraction process. Lastly, the final area selected for this project was lower 
region of face that includes nose and mouth but excluding eyes and chin area. This area was selected because 
the non-rigid motion is less and the area dimension is wider compared to the right cheek region. From 
the selected ROI, random pairtemporal green color channel values that indicate the PPG signal is extracted. 
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Figure 3. Region of interest 
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2.2. PPG signal extraction 

PPG signal was extracted from the green channel of the constructed ROI because of green channel 
contains strongest information for signal extraction due to the light sensitivity of hemoglobin. 
Since the extracted signal contained unwanted interference, mainly illumination variation, BSS technique 
known as ICA was used to separate the illumination vanation from the true PPG signal. The visual 
representation of signal extraction is shown in Figure 4 where the top part is the onginal raw signal, 
and the bottom part is two signal produced by the ICA which the refined signal and its illumination variation. 
It can be observed that the polished signal pattern is more or less identical with the raw one as opposed to 
the predicted noise. 

PPG signal using can be modelled using (2) where PPGraw is the true PPG signal and s is the green 
channel signal and y is the variation of illumination. If the parameter y can be estimated directly, then pure 
cardiac signal can be obtained easily, however in practice such signal cannot be measured straight away 
and hence ICA is used. ICA uncover the independent source of the signal by maximizing or minimizing 
a cost function of the mixing that measure non-Gausianity to uncover the mixture coefficients. 


PPGraw =S +y (2 ) 





PPG Signal Illumination Variation 


Figure 4. Visualization pf signal separation process using ICA 


PPG signal obtained after the ICA process is a raw signal and still contains a fragment of unwanted 
noise. Thus some signal filtering processes were applied to obtain refined PPG signal. In this paper, 
detrending and moving average filters were applied to the signal for reducing slow, non-stationary trend 
of signal and polishing the random noise, making the signal smooth prior to frequency domain conversion 
[23, 24]. The filtered PPG signal was converted to frequency domain to determine the power spectrum 
density (PSD) using Welch method [25] with the constrained frequency spectrum within the range of 0.7Hz 
to 4 Hz that represent the HR value range from 42bpm to 240bpm. Lastly, the HR is calculated 
by multiplying the maximal PSD response with 60. The filtermg processes for PPG signal were as 
shown in Figure 5. 
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Figure 5. Detrending and moving average process 
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2.3. Histrogram analysis 

The HR value obtained from a single sequence calculation is still subject to the variation and hence 
a histogram based analysis is performed to determine the consistent reading against repetition. In this paper 
10 repetition of HR reading from the same ima ge sequence is performed. Any HR reading that is significant 
with the majority in the bin will be labelled as an outlier and eliminated while the remaining will be used 
to determine an average of HR which shows consistency over the sampled time period. 


3. RESULTS AND ANALYSIS 

In the experiment, video recording was conducted in an indoor laboratory and normal ambient light 
was used as lighting source. The videos were recorded with 1440x1080 pixels resolution at 60 FPS. 
The camera-subject distances vary between | meter, 3 meters and 5 meters respectively as shown in Figure 6. 
Pulse oximeter was attached to the participants’ finger and reading from this device was made into ground 
truth for this project. There were two experiments conducted for this paper. The first experiment was 
to determine the minimum time requires for face video recording used in this project. For this analysis, 
2 seconds, 5 seconds and 10 seconds videos were used. Another experiment was conducted to determine 
the effect of using compress video on the HR accuracy. Original video format is mov whereas the compressed 
video format is wmv, a fair comparison was made between HR results that were obtained from original 
and compressed videos. For this project, the accuracy and the error percentage were calculated using 
the shown equations. 


Percentage Saar —- value —actual waive) «100% (3) 
actual value 
Accuracy (%) = 100% — Percent of Error (4) 


RGB Camera 
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Ambient light 


Up to several meters 





Figure 6. General experiment set up for this project 


Based on both tables reading, the errors calculated for each recording time with respect to distances 
did not exceed 50% which means that the system is capable of working properly even with short video 
sampling time. It can also be seen clearly that the three distances, 2 seconds sampling time provide the most 
accurate reading with 75% for 1 meter, 94% for 3 meters and 79.9% for 5 meters. As the sampling time 
mcreased (5 seconds and 10 seconds), the accuracy significantly reduced with an average of below 80% 
for various ranging of distances. Thus, based on this result, it was proven that the system able to work 
properly with acceptable accuracy with two seconds of sampling time. It is also worth to mention that, most 
of commercial pulse oximeter device required more than 5 seconds to obtain the HR reading. 
However for the non-contact camera based system, ourresults showed that reading with 2 seconds is enough. 

Even though the accuracy from Table | and Table 2 is relatively high (above 80%), it was executed 
using compressed video format (wmy) to speed up the processing time. In the next experiment, we showed 
the effect of uncompressed video to the HR reading. In principle, with the uncompressed video image 
the processing time will be increased since density of the pixels in the picture fragment is slightly higher. 
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For this analysis, original images from the video frames (mov format) were used and compared 
with the uncompressed one (wmv). The time assigned was two seconds for both videos format since 
previously we have shown that this is optimal time sampling for calculating the HR. Technically 120 image 
frames from the video was used as input stream to the system. The result for this analysis was as shown 
in Table 3. 

Based on the result obtained from Table 3, the reading of HR reading from original video showed 
great consistency and did not varies much from the ground truth. The experiment was repeated five times 
to determine the reading variation for the estimated HR. For experiment that used the compress video, 
there are fluctuations in the HR reading. For example, the HR reading of compressed video for 3 meter 
distance, the differences between the first reading and the second reading was inconsistence. However, there 
was no fluctuation in results obtained from original video. This clearly showed that using compressed video 
as input for this system affected the HR reading accuracy. This happened because when the video format was 
changed, compression occurred to the videos and perhaps there were information loss during the compression 
process which caused the HR to be inconsistence and inaccurate. 


Table 1. HR Reading (Bpm) for assigned time 
AR reading for respective duration (bpm) 


Subjects Distance(m) Ground Tmth(bpm) —y<Sconds Sseconds 1Oseconds 


1 91 86 70 a9 
1 3 94 70 94 65 
5 94 85 111 116 
1 65 91 115 88 
2 3 1 61 92 56 
5 75 89 76 82 
1 70 88 92 109 
3 3 65 71 71 72 
5 69 104 77 64 
1 69 76 85 64 
4 3 68 76 85 71 
5 70 68 100 107 
1 61 97 82 82 
5 3 56 67 68 67 
5 67 107 86 58 
1 67 76 88 76 
6 3 68 62 92 id 
5 66 73 82 85 
1 80 97 67 61 
7 3 68 83 80 61 
5 75 67 67 97 
1 56 70 74 88 
8 3 59 64 aD 62 
5 59 62 92 86 


Table 2. Accuracy Summary from Table 1 
Time (seconds) 
Range (m) 7 : v 
Accuracy (%) Accuracy (%) Accuracy (%) 
(HR accuracy results for near distance) 


1 75.00 66.40 70.00 

3 94.00 80.60 84.40 
(HR accuracy results for farther distance) 

5 79.90 75.50 73.30 

Average 82.97 74.17 75.90 


Table 3. Bpm readings for compressed and original video with 2 seconds sampling time 


Video Type Distance(m) Ground Truth (bpm) eee 
1 65 92 94 94 101 91 
Compressed Video (wmv) 3 75 82 118 74 oa 124 
5 1S 68 79 69 74 98 
1 65 67 67 67 67 67 
Original Video (mov) 3 75 74 74 74 74 74 
5 75 60 60 60 60 60 
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4. CONCLUSION AND FUTURE WORK 

This paper investigates time sampling requirements for calculating HR from compressed 
and uncompressed video samples. The PPG signals were extracted from eight different subjects that was 
recorded at 60 FPS from three different distances of 1 meter, 3 meters and 5 meters and hence producing 24 
video samples. In the first experiment, three sampling time were analyze which are 2 seconds, 5 seconds 
and 10 seconds. Averagely from all various distances sampling time of 2 seconds yield 83% accuracy 
and beyond this point the accuracy level deteriorate significantly with below than 80%, and hence we 
conclude that 2 seconds sampling time is optimalto measure the HR that obtain from camera. Inthe second 
experiment we conclude that original uncompressed video or high quality videos will yields accurate 
and stable HR reading, but at a cost of longer processing time. In future, the system can be improved by 
optimizing the processing time for real implementation. It can be done by utilizing faster facial pomt’s 
extractor module. Apart from that, accuracy of the system can also be improved by tuning the signal 
processing part for adapting more robust motion and illumination artifact. 
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